\(\def \d \ensuremath{\mathbf{d}}\) \(\def \e \ensuremath{\mathbf{e}}\) \(\def \g \ensuremath{\mathbf{g}}\) \(\def \I \ensuremath{\mathbf{I}}\) \(\def \l \ensuremath{\mathbf{l}}\) \(\def \M \ensuremath{\mathbf{M}}\) \(\def \W \ensuremath{\mathbf{W}}\) \(\def \Omegab \ensuremath{\boldsymbol{\Omega}}\)

Here we are going to analyse the classical wine data which gives different characteristic of a wine for three different cultivars. We will apply different techniques from classical Multivariate Analysis such as

  • Basic plotting and exploratory analysis of the raw data
  • Comparison among different types of confidence intervals/ellipsoid of mean
  • Principal Component Analysis
  • Factor Analysis
  • Classification and Discrimination

1 Basic Plotting and exploratory analysis of the raw data

1.1 Summary Statistics

The columns of the data matrix are

##  [1] "cultivar"     "alchohol"     "malic.acid"   "ash"          "alka.ash"    
##  [6] "magnesium"    "phenol"       "flavanoid"    "nonflav.phen" "proant"      
## [11] "color.intens" "hue"          "OD/OD"        "proline"

So data gives measurement on 13 characteristic of three different types of wines. We first plot the data to see how different variables behave group wise.

We can see the first few lines of the dataset
cultivar alchohol malic.acid ash alka.ash magnesium phenol flavanoid nonflav.phen proant color.intens hue OD/OD proline
1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735
1 14.20 1.76 2.45 15.2 112 3.27 3.39 0.34 1.97 6.75 1.05 2.85 1450
Basic Statistical summary for all the variables-
cultivar alchohol malic.acid ash alka.ash magnesium phenol flavanoid nonflav.phen proant color.intens hue OD/OD proline
1:59 Min. :11.03 Min. :0.740 Min. :1.360 Min. :10.60 Min. : 70.00 Min. :0.980 Min. :0.340 Min. :0.1300 Min. :0.410 Min. : 1.280 Min. :0.4800 Min. :1.270 Min. : 278.0
2:71 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210 1st Qu.:17.20 1st Qu.: 88.00 1st Qu.:1.742 1st Qu.:1.205 1st Qu.:0.2700 1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825 1st Qu.:1.938 1st Qu.: 500.5
3:48 Median :13.05 Median :1.865 Median :2.360 Median :19.50 Median : 98.00 Median :2.355 Median :2.135 Median :0.3400 Median :1.555 Median : 4.690 Median :0.9650 Median :2.780 Median : 673.5
NA Mean :13.00 Mean :2.336 Mean :2.367 Mean :19.49 Mean : 99.74 Mean :2.295 Mean :2.029 Mean :0.3619 Mean :1.591 Mean : 5.058 Mean :0.9574 Mean :2.612 Mean : 746.9
NA 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558 3rd Qu.:21.50 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.875 3rd Qu.:0.4375 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200 3rd Qu.:3.170 3rd Qu.: 985.0
NA Max. :14.83 Max. :5.800 Max. :3.230 Max. :30.00 Max. :162.00 Max. :3.880 Max. :5.080 Max. :0.6600 Max. :3.580 Max. :13.000 Max. :1.7100 Max. :4.000 Max. :1680.0

1.2 Ploting the Data

Now we draw the pairwise plots among the variables

The relationship is not that clear. For this reason we find these results groupwise using different colours for different groups.

All the three pairwise plots for three groups shows different patterns.Three groups are seperated by different colours. So we get an idea that the measured characteristic may be very different groupwise. From the graphs we see two pairs of variables are correlated fairly phenol & flavanoid and ash & alka.ash in 3rd group. Also the behavior of malic.acid & alchohol is very different groupwise. Overall 3-rd group has the least number of correlated variables.

First we include the boxplot of each variable

Above we have shown the boxplot for all the 13 different variables in 3 different groups. As we can clearly observe that the same variable has different mean and different variance across different groups. Only the variables magnesium and ash can be thought of as having same mean and variance across different groups.

Also from the estimated density in the diagonal of the pairwise scatter plot we can observe that the marginals are not normal.

2 Checking Mutivariate Normality

Here we give several plots to check Multivariate Normality and build some intution of the kind of data we have got. For this here we have included different types plots like

  • Univariate Q-Q plots
  • Univariate Histograms
  • Chi-square Q-Q plot
  • Bivariate density plot
  • Contour plot

2.1 For whole Data

First we give the Q-Q plots

More than half of the histograms shows departure from normality.

We give the histograms below

In addition, histograms suggests multimodality in the data.

We also give the chi square Q-Q plot

We also give the some contour plots

We can observe from the Q-Q plots that the variables are noway near normality. Also histograms of each variable shows that the distribution may be mutimodal. Also the chi-square Q-Q plots shows a lot of deviation from multivariate normality. These observations are enough to conclude that the data is not multivariate normal but we provide the Bivariate Density plots and the contour plots to build some intution of the data. From each of the above densities and contours we can observe in common that the distribution is Multimodal or more specifically TriModal. This is quite natural because here we have data from three different groups and there is high possibility that the each mode correspons to each group. Another thing we can check is mutivariate normality for each group.

2.2 Groupwise

First we give the Q-Q plots for group 1 (i.e 1 st cultivar)

Again we see a lot of departure from normality. We see similar things with 2nd and 3rd cultivar.

Now we give the groupwise density plots

These densities shows nonnormality and strong heterogeneity of different cultivars.

Also we give the groupwise contours

The contours no way close to concentric ellipses. What is worse is that these plots suggests multimodality in the data (Our previous guess about each mode correspons to each group is not correct). This characters makes the the data intractable by classical Multivariate methods.

All the plots suggests strong deviation from normality.

So we conclude that neither the whole data nor the groupwise data is anyway close to normal.

3 Inference about Mean

Here we focus on the inference about the mean vector and various types of confidence intervals.

For performing test and constructing Confidence Intervals one of the most important assumption is that of normality of the multivariate observations. Since we have previously observed that the observations are no way close to multivariate normal so we try to find some subset of the variables that can be thought closely multivariate normal and perform the corresponding analysis.

3.1 Selection of variables

We first select some subset of variables which are closely multivariate normal. Since we have already observed that our data is not multivariate normal, so we can not make inference on all the variables.

We first give the shapiro wilk test for normality of each component
x
alchohol 0.020
malic.acid 0.000
ash 0.039
alka.ash 0.264
magnesium 0.000
phenol 0.004
flavanoid 0.000
nonflav.phen 0.000
proant 0.014
color.intens 0.000
hue 0.017
OD/OD 0.000
proline 0.000

As we can observe above that at the variables alchohol, ash, alka.ash, proant and hue are normal at 0.01 level of significance. But from the q-q plots we observe that only the variables ash, alka.ash and proant look close to normal. So we consider these three variables and try to see their multivariate normality.

Clearly we see below that the three variables are multivariate normal but clearly the three groups does not have homogeneous covariance matrix.

## 
##  Generalized Shapiro-Wilk test for Multivariate Normality by
##  Villasenor-Alva and Gonzalez-Estrada
## 
## data:  .
## MVW = 0.99086, p-value = 0.286
## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  wine_data %>% select(c("ash", "alka.ash", "proant"))
## Chi-Sq (approx.) = 48.238, df = 12, p-value = 2.841e-06

We remove the variable ash and perform the multivariate normality test and homogeniety test again to see that the data is bivariate normal but the groups have still non homogeneous covariance matrices. This time the p-value has increased a bit.

## 
##  Generalized Shapiro-Wilk test for Multivariate Normality by
##  Villasenor-Alva and Gonzalez-Estrada
## 
## data:  .
## MVW = 0.98941, p-value = 0.1635
## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  wine_data %>% select(c("alka.ash", "proant"))
## Chi-Sq (approx.) = 27.213, df = 6, p-value = 0.0001321

So we perform the analysis with the three variables ash, alka.ash and proant.

Although the assumption of homogeneous covariance matrices are not satisfied we perform some analysis which requires these assumption because we can not improve the situation more.

3.2 Constructing Simultaneous Confidence Interval for the Mean Vector

Here we deal with only the abovementioed three variables and assume that \(X_{1},X_{2},\dots,X_{n}\) are iid observations from \(N_{3}(\mu,\Sigma)\). Here we find simutaneous confidence intervals for \(\mu=(\mu_{1},\mu_{2},\mu_{3})'\) based on two methods-

  • Scheffe’s (Exact) simultaneous confidence interval
  • Large Sample simultaneous confidence interval

3.2.1 Scheffe’s (Exact) simultaneous confidence interval

To construct Scheffe’s confidence interval we use

\[a^{T}\bar{X}\pm\sqrt{\dfrac{(n-1)p}{n-p}F_{p,n-p}(\alpha)}\sqrt{\dfrac{a^{T}Sa}{n}}\]

So we calculate below the Scheffe’s simultaneous Confidence interval

##      ash      ash 
## 2.308139 2.424895 
## alka.ash alka.ash 
## 18.78432 20.20557 
##   proant   proant 
## 1.469107 1.712691

Simultaneous C.I. for mean \(\mu\) is

\[2.308\leq\mu_{1}\leq2.425\] \[18.784\leq\mu_{2}\leq20.205\] \[1.469\leq\mu_{3}\leq1.713\]

3.2.2 Large Sample Simultaneous Confidence Interval

To construct Large Sample confidence interval we use

\[a^{T}\bar{X}\pm\sqrt{\chi_{p}^2(\alpha)}\sqrt{\dfrac{a^{T}Sa}{n}}\]

So we calculate the Large Sample confidence interval below

##      ash      ash 
## 2.309033 2.424000 
## alka.ash alka.ash 
## 18.79520 20.19468 
##   proant   proant 
## 1.470972 1.710825

Simultaneous C.I. for mean \(\mu\) is

\[2.309\leq\mu_{1}\leq2.424\] \[18.795\leq\mu_{2}\leq20.195\] \[1.471\leq\mu_{3}\leq1.711\]

Observe that both the simultaneous C.I. are of almost same length. Large sample one is a bit shorter, hence little better.

3.3 Confidence Ellipse

Here we will find confidence ellipse for \((\mu_{i},\mu_{j})\) based on the above two methods

  • Scheffe’s (Exact) Method
  • Large Sample Method

3.3.1 Scheffe’s Method

For Scheffe’s Method the \(100(1-\alpha)\%\) simultaneous ellipse for \((\mu_{i},\mu_{k})\) belongs to the sample mean centered ellipses

\[n(\bar{x_{i}}-\mu_{i},\bar{x_{k}}-\mu_{k})\begin{bmatrix}s_{ii} & s_{ik} \\s_{ik} & s_{kk}\end{bmatrix}^{-1}(\bar{x_{i}}-\mu_{i},\bar{x_{k}}-\mu_{k})'\leq\dfrac{(n-1)p}{n-p}F_{p,n-p}(\alpha)\]

Here we find the intervals for the pairs \((\mu_{2},\mu_{3})\) and plot it.

3.3.2 Large Sample Method

For Scheffe’s Method the \(100(1-\alpha)\%\) simultaneous ellipse for \((\mu_{i},\mu_{k})\) belongs to the sample mean centered ellipses

\[n(\bar{x_{i}}-\mu_{i},\bar{x_{k}}-\mu_{k})\begin{bmatrix}s_{ii} & s_{ik} \\s_{ik} & s_{kk}\end{bmatrix}^{-1}(\bar{x_{i}}-\mu_{i},\bar{x_{k}}-\mu_{k})'\leq\chi_{p}^{2}(\alpha)\]

Here we find the intervals for the pairs \((\mu_{2},\mu_{3})\) and plot it.

We can observe that both of them are of very similar type.

3.3.3 Comparison of Exact and Large Sample Intervals

Here we plot both types of Confidence Ellipse i.e. the ellipses based on Large Sample Method as well as Scheffe’s Method to see if there is any significant difference between them.

Clearly we can observe that both the Confidence Ellipses are almost coinciding with Large sample ellipse is of small area than Exact ellipse.

3.4 Bonferroni Confidence Interval

The Bonferroni \(100(1-\alpha)\%\) confidence interval for \(\mu_{i}\) is given by

\[\bar{x_{i}}\pm t_{n-1}\left(\frac{\alpha}{2p}\right)\sqrt{\frac{s_{ii}}{n}},\quad\text{for}\quad i=1,2,\dots,p.\]

We find the Bonferroni’s Simultaneous interval below

##      ash      ash 
## 2.316817 2.416216 
## alka.ash alka.ash 
## 18.88996 20.09993 
##   proant   proant 
## 1.487212 1.694586

Simultaneous C.I. for mean \(\mu\) is

\[2.317\leq\mu_{1}\leq2.416\] \[18.89\leq\mu_{2}\leq20.1\] \[1.487\leq\mu_{3}\leq1.694\]

3.5 Comparison of all the Intervals

Now we plot the Bonferroni’s Interval in the previous plot for best comparison purpose.

In the above plot dotted blue lines represents the Bonferroni’s Interval whereas the orange dotted line and black dotted line respectively give the Large Sample and Scheffe’s Interval. Clearly the Bonferroni’s Confidence Rectangle is much shorter than Scheffe’s and Large Sample simultaneous ellipses. This result is not surprising since we know that the Bonferroni’s method is more conservative.

Similarly we can find the ellipses for other two pairs \((\mu_{1},\mu_{2})\) and \((\mu_{1},\mu_{3})\).

4 Principal Component Analysis (PCA)

In this part we focus on PCA (Principal Component Analysis) and its interpretation in explaining variance.

The main aim of PCA is to reduce the dimension of the raw data by finding small number of variable which explains a fair amount of variability present in the raw data.

4.1 Understanding the Intrinsic Dimension Through Plots

Since there are thirteen variables, it is ver difficult to understand the correlation structure of the variables by watching at the 13-by-13 correlation matrix. So we give the image of the correlation matrix, where colours vary according to correlation given in the colourbar. We can easily observe that almost all the correlation lie in the range \(-0.5\) to \(0.5\) and several of them are close to \(0\).

There are are some 3 or 4 panels with similar types of colours.So what we can say intuitively that there may be 3/4 or more principal components to explain a fair amount of variability. Since the main aim of PCA is to reduce the dimension of the raw data (in this case \(13\)), we may not gain much in reducing dimension.

To perform PCA we have to first decide upon whether to proceed with the raw data or the scaled data. We give below the column means
Mean SD
alchohol 13.00 0.81
malic.acid 2.34 1.12
ash 2.37 0.27
alka.ash 19.49 3.34
magnesium 99.74 14.28
phenol 2.30 0.63
flavanoid 2.03 1.00
nonflav.phen 0.36 0.12
proant 1.59 0.57
color.intens 5.06 2.32
hue 0.96 0.23
OD/OD 2.61 0.71
proline 746.89 314.91

Clearly we observe that the variables have mean and standard deviation of different order. So if we perform PCA on the covariance matrix, some specific variables may get more preference compared to others, which is not desirable. So we decide to perform PCA on the correlation matrix.

Scatterplots do not say much about the dataset. And it is difficult to guess the intrinsic dimensions from the data which explains fair amount of variability.

4.2 Computation of PC

We perform the PCA now.

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13
Standard deviation 2.169 1.580 1.203 0.959 0.924 0.801 0.742 0.590 0.537 0.501 0.475 0.411 0.322
Proportion of Variance 0.362 0.192 0.111 0.071 0.066 0.049 0.042 0.027 0.022 0.019 0.017 0.013 0.008
Cumulative Proportion 0.362 0.554 0.665 0.736 0.802 0.851 0.893 0.920 0.942 0.962 0.979 0.992 1.000

The first five components explains more than 80% of the variability.

So what we observe that we need many dimension (in this case at least three) to explain fair amount of variability. We plot the first two principal components to understand its interpretation, if at all we find any.

4.3 Interpretation of PC

First we observe the PC loadings below

PC1 PC2 PC3 PC4 PC5
alchohol -0.144 0.484 -0.207 0.018 -0.266
malic.acid 0.245 0.225 0.089 -0.537 0.035
ash 0.002 0.316 0.626 0.214 -0.143
alka.ash 0.239 -0.011 0.612 -0.061 0.066
magnesium -0.142 0.300 0.131 0.352 0.727
phenol -0.395 0.065 0.146 -0.198 -0.149
flavanoid -0.423 -0.003 0.151 -0.152 -0.109
nonflav.phen 0.299 0.029 0.170 0.203 -0.501
proant -0.313 0.039 0.149 -0.399 0.137
color.intens 0.089 0.530 -0.137 -0.066 -0.076
hue -0.297 -0.279 0.085 0.428 -0.174
OD/OD -0.376 -0.164 0.166 -0.184 -0.101
proline -0.287 0.365 -0.127 0.232 -0.158

Just by observing the loadings we can not easily interpret the principal components. One of the reasons for not identifying the components is the excessive number of variables (in this case 13) available to us.

4.3.1 Separating in Groups by some characteristic

For the purpose of interpretation we plot the principal component scores.

We first plot the principal component scores in two dimension for the first three components.

From the graph it is highly possible that the three clusters seperated by the lines may have different characters. In that case principal components would explain the variability among that characteristic. A first guess for the chracteristic is the three different grouping of the variables i.e. the grouping according to three different cultivars. So we plot the principal component scores by the cultivar.

Now clearly the first two PCs explains the the variability among the groups.

Now we plot 2nd and 3rd PCs to see if there is any meaningful we can find

From the graph it is highly possible that the two clusters seperated by the line may have different characters. In that case second and third principal components would explain the variability among that characteristic. Since there are three different groups of cultivars this variability among two clusters can not be due to different cultivars. To see this we plot the principal component scores by the cultivar for PC2 and PC3.

Clearly the 1st and the 3rd cultivars are overlapping w.r.t. PC2 and PC3. But this picture shows that 1st and 3rd grup may have some characteristic very similar and 2nd group has that particular characteristic different from them. To find that characteristic we combine 1st and 3rd group to consider them as single group and trat 2nd group different from them. we plot the observations according to their group by their variables

All the above plots suggests the only variable with the desired property (1st,3rd group has similar characteristic and 2nd grup has different from them) is alchohol . So we can say that 2nd and 3rd principal component explains the variability in the alchohol content ( this characteristic is extremely important for a wine, probably the most important) in the wines. So there are two groups in the data and these groups are driven by the alchohol Content in the wine.

So these way principal components can be used to explore different patterns in the data.

4.3.2 Projection on the Plane

We now show find graphically the variability of the projection fo the observations along the first and second principal component axis.

Clearly we can observe that variability along the PC1 axis is much higher than variability along PC2 axis.

Now we provide below some PC plots to see their behaviour of the projections.

All the graph shows that variability along the y axis is at most the variability along the x axis. In the PC12 vs PC13 graph shows almost the same amount of variability because PC12 and PC13 explains almost the same amount of variability. So these results are as expected.

5 Factor Analysis

Since here we have several covariates so it is difficult to infer anything about the wine quality taking into account all the 13 covariates. So here our objective is to determine the small number of hidden/latent factors which largely affects the wine quality.

Here our aim will be to

  • to exhibit a relationship between the measured and underlying hidden/latent variables
  • to estimate the hidden/latent variables

Here our model is

\[X-\mu=LF+\epsilon\]

where the symbols bears their usual significance. Here the assumptions are

  • \(\mathcal{E}(F)=0, \quad \mathcal{Cov}(F)=I_{m}\)
  • \(\mathcal{E}(\epsilon)=0 , \quad\mathcal{Cov}(\epsilon)=diag(\psi_{1},\dots,\psi_{p})\)
  • \(\mathcal{Cov}(\epsilon,F)=O_{p\times m}\)

Because the sample mean and sample variances are different order using similar logic as in PCA we proceed to use correlation matrix for the factor analysis as well.

From the image of the correlation in the section of PCA we can say that there are at least 4/5 different correlation structure among the variables. So we can guess that we may need higher factor model (\(k\geq 4\) in \(k-\)Factor Model) to get some sort of interpretation of the hidden/latent factors.

We can estimate the factor loadings, the specific variances the hence the communalities using different methods such as

  • Principal Component Method
  • Iterative PC Method
  • Maximum Likelihood Method

The first two methods are completely nonparametric (i.e. no distributional assumption of the observations), but the third method requires the assumption of multivariate normality of the observations. But we have already observed in section 2 that the observations are are far from multivariate normal. Also we will later that deleting outliers will no do anything special in making the observations multivariate normal. So we will not go further in detail of Maximum Likelihood Method of factor analysis. Rather we will discuss the first two nonparametric methods in details.

5.1 Deciding on the number of factors

Since we have got \(p=13\) variables, so the number of parameter in the covariance matrix is

\[\dfrac{p(p+1)}{2}=\dfrac{13\times14}{2}=91\]

If we use \(m-\)factor model then the dimension of the model is \(p(m+1)\)

So, if \(m=5\), \(p(m+1)=13\times6=78\), so the model reduces the dimension. But if \(m\geq6\), \(p(m+1) \geq 91\), so the model does not help in dimension reduction. Then it is better to base our our inference only on covariance matrix.

So we will consider only the \(m-\text{factor models}\) with \(m\leq5\).

Before proceeding to analyse the data with two of the above methods, we must decide upon how many factors to retain. We will mainly follow two methods for choosing the number of factors to retain.

  • Scree Plot
  • Interpretability

5.1.1 Scree Plot

This is proportion of variability explained by eigenvalues of the correlation matrix of the observations.

Clearly the first four or five eigenvalues are the dominating ones. So we fit 4-Factor or 5-Factor models.

5.1.2 Interpretability

Now after fitting models with different number of factors based on the scree plots, we finally choose a particular factor model based on the interpretability. Interpretability generally refers to the interpretation of the common factors. Model for which the interpretation of the common factors are clearer is preffered.

5.2 Principal Component Method

We first consider the Principal Component Method of estimation of the loadings. We adapt two different factors

  • without rotation
  • varimax rotation

5.2.1 Without rotation

5.2.1.1 4-Factor Model

The loading for the unrotated 4-Factor Model are

PC1 PC2 PC3 PC4
alchohol 0.313 0.764 -0.249 -0.017
malic.acid -0.532 0.355 0.107 0.515
ash -0.004 0.499 0.753 -0.205
alka.ash -0.519 -0.017 0.736 0.058
magnesium 0.308 0.473 0.157 -0.337
phenol 0.856 0.103 0.176 0.19
flavanoid 0.917 -0.005 0.181 0.146
nonflav.phen -0.648 0.045 0.205 -0.195
proant 0.68 0.062 0.18 0.383
color.intens -0.192 0.837 -0.165 0.063
hue 0.644 -0.441 0.102 -0.41
OD/OD 0.816 -0.26 0.2 0.177
proline 0.622 0.577 -0.152 -0.222
a PC Loading without rotation

We should note that the factor loadings are the correlations between the factors and the variables. For example, the correlation between the flavanoid and the first factor is about 0.917. Similarly, the correlation between ash and that factor is only about -0.004.

Interpreting factor loadings is similar to interpreting the coefficients for principal component analysis. We want to determine some inclusion criterion, which in many instances, may be somewhat arbitrary. In the above table, the values that we consider large are in red colour, using about .5 as the cutoff. The following statements are based on this criterion:

  • Factor 1 is correlated most strongly with flavanoid (0.917) and also correlated with phenol, nonflav.phen, proant, hue, OD/OD, proline, and to a lesser extent malic.acid and alka.ash. Here we can see that Factor 1 is associated with high levels of phenol, flavanoid, proant, hue, OD/OD and proline and low levels of malic acid, alka.ash and nonflav.phen. This distinguishes wines with high content of phenol, flavanoid, proant, hue, OD/OD and proline from wines with low content of malic acid, alka.ash and nonflav.phen.

  • Factor 2 is primarily related to alchohol, color.intens and proline. Here we can see that Factor 2 is associated with high levels of alchohol, color.intens and proline. You can say that the first factor is primarily a measure of these variables.

  • Factor 3 is primarily a measure of ash and alka.ash (by similar logic as before)

  • Factor 4 is primarily a measure of malic acid.

Note that magnesium is not related to any of the factors in 4-Factor Model.

5.2.1.2 5-Factor Model

The loading for the unrotated 4-Factor Model are

PC1 PC2 PC3 PC4 PC5
alchohol 0.313 0.764 -0.249 -0.017 0.245
malic.acid -0.532 0.355 0.107 0.515 -0.033
ash -0.004 0.499 0.753 -0.205 0.132
alka.ash -0.519 -0.017 0.736 0.058 -0.061
magnesium 0.308 0.473 0.157 -0.337 -0.672
phenol 0.856 0.103 0.176 0.19 0.138
flavanoid 0.917 -0.005 0.181 0.146 0.101
nonflav.phen -0.648 0.045 0.205 -0.195 0.463
proant 0.68 0.062 0.18 0.383 -0.126
color.intens -0.192 0.837 -0.165 0.063 0.071
hue 0.644 -0.441 0.102 -0.41 0.16
OD/OD 0.816 -0.26 0.2 0.177 0.093
proline 0.622 0.577 -0.152 -0.222 0.146
a PC Loading without rotation

Here the first four factors remains the same. Now magnesium is related to fifth factor. Fifth factor may be interpreted as a measure of magnesium.

Communalities for the 4-Factor model is

x
alchohol 0.7446016
malic.acid 0.6855860
ash 0.8587070
alka.ash 0.8149677
magnesium 0.4575148
phenol 0.8104849
flavanoid 0.8959272
nonflav.phen 0.5014181
proant 0.6447938
color.intens 0.7696000
hue 0.7876596
OD/OD 0.8044562
proline 0.7921530
a Communalities

The results suggest that the factor analysis does the best job of explaining variation in flavanoid and ash.

One assessment of how well this model performs can be obtained from the communalities. We want to see values that are close to one. This indicates that the model explains most of the variation for those variables. In this case, the model does better for some variables than it does for others. The model explains flavanoid the best, and is not bad for other variables such as ash, alka.ash, phenol and OD/OD . However, for some other variables such as magnesium and nonflav.phen the 4-Factor model does not do a good job, explaining only about half of the variation.

Similarly we provide the communalities for the 5-Factor model below

x
alchohol 0.8048201
malic.acid 0.6866440
ash 0.8761608
alka.ash 0.8186960
magnesium 0.9085308
phenol 0.8295085
flavanoid 0.9060692
nonflav.phen 0.7153253
proant 0.6607752
color.intens 0.7745850
hue 0.8133776
OD/OD 0.8131878
proline 0.8134176
a Communalities

One significant change has happed with 5-Factor model is that it now explains the variability in magnesium and nonflav phen very well, which was not the case before.

5.2.2 Varimax Rotation

The problem with the previous analysis is that some of the variables are highlighted in more than one column. For instance, proline appears significant to Factor 1 AND Factor 2. The same is true for alka.ash in both Factors 1 AND 3. This does not provide a very clean, simple interpretation of the data. Ideally, each variable would appear as a significant contributer in one column.

In fact the above table may indicate contradictory results. Looking at some of the observations, it is conceivable that we will find an observation that takes a high value on both Factors 1 and 4. If this occurs, a high value for Factor 1 suggests that the wine has low malic acid content, whereas the high value for Factor 4 suggests the opposite, that the wine has high malic acid content.

These points suggests that out interpretation previously was not that good. To remove some of the aformentioned defects we rotate the factor loading with the help of orthogonal matrices. This is actually called varimax transformation.

5.2.2.1 4-Factor Model

The loading for the varimax rotated 4-Factor Model are

RC1 RC2 RC4 RC3
alchohol 0.162 0.813 -0.209 -0.119
malic.acid -0.161 -0.06 -0.79 0.178
ash 0.062 0.344 -0.041 0.857
alka.ash -0.215 -0.348 -0.236 0.769
magnesium 0.105 0.587 0.185 0.259
phenol 0.842 0.253 0.195 0.012
flavanoid 0.875 0.193 0.306 -0.009
nonflav.phen -0.583 -0.137 -0.15 0.346
proant 0.797 0.095 -0.012 0.002
color.intens -0.194 0.673 -0.526 0.047
hue 0.361 -0.059 0.808 -0.036
OD/OD 0.82 -0.072 0.355 -0.036
proline 0.342 0.798 0.176 -0.084
a PC Loading with rotation

We can interpret the factors similarly as in the unrotated Principal Component Method. But the point to be noted is that compared to 4-Factor PC (unrotated) model the loading matrix makes much more sense in this scenario. That is all the variables has high loading in at least one factor (which was not the case previously). But one little defect still remains. Color.intens has very high loading on factor 2 but simultaneously has low loading on factor 4. This may give rise to contradictory result.

To remove this problem we consider 5-Factor rotated model to see if there is any improvement.

5.2.2.2 5-Factor Model

The loading for the varimax rotated 4-Factor Model are

RC1 RC2 RC4 RC3 RC5
alchohol 0.148 0.876 -0.109 -0.011 0.058
malic.acid -0.146 0.006 -0.789 0.173 -0.112
ash 0.084 0.243 -0.032 0.886 0.155
alka.ash -0.182 -0.424 -0.294 0.721 -0.008
magnesium 0.098 0.229 0.053 0.148 0.907
phenol 0.836 0.268 0.24 0.031 0.026
flavanoid 0.868 0.189 0.339 -0.003 0.051
nonflav.phen -0.566 0.004 -0.075 0.444 -0.439
proant 0.795 0.055 -0.026 -0.042 0.155
color.intens -0.198 0.696 -0.472 0.118 0.122
hue 0.352 -0.087 0.825 -0.022 -0.021
OD/OD 0.816 -0.058 0.374 -0.048 -0.046
proline 0.324 0.775 0.246 -0.007 0.217
a PC Loading with rotation

Now we see that the defects are removed from the loading matrix.

We our factors are

Factor 1 distinguishes wines with high phenol, flavanoid, proant and OD/OD from the wines with low level of nanflav.phen.

Factor 2 is the measure of alchohol, color intensity and proline

Factor 3 and Factor 5 is similarly interpretated as Factor 2

Factor 4 is similarly interpretated as Factor 1

Communalities remain the same as unrotated case.

5.3 Iterative PC Method

5.3.1 Unrotated Case

loading in unrotated case

PA1 PA2 PA3 PA4 PA5
alchohol 0.305 0.688 -0.262 0.167 0.154
malic.acid -0.472 0.283 0.053 0.126 -0.181
ash 0.006 0.519 0.738 0.122 0.204
alka.ash -0.488 0.007 0.614 0.062 -0.093
magnesium 0.321 0.526 0.175 -0.755 -0.14
phenol 0.845 0.087 0.119 0.193 -0.137
flavanoid 0.937 -0.018 0.152 0.18 -0.129
nonflav.phen -0.584 0.044 0.153 0.112 0.212
proant 0.623 0.044 0.087 0.058 -0.254
color.intens -0.182 0.763 -0.184 0.144 -0.032
hue 0.646 -0.445 0.127 -0.16 0.472
OD/OD 0.796 -0.256 0.161 0.12 -0.099
proline 0.61 0.536 -0.175 0.021 0.251
a PA Loading without rotation

We can observe that correspondding to same variable there are significant loading in two factors for many variables. Also the last factor has no high correlation with any of of the variables. So the last factor seems useless.

5.3.2 Varimax Rotation

loading in rotated case

PA1 PA2 PA5 PA3 PA4
alchohol 0.148 0.803 -0.121 0.006 0.076
malic.acid -0.273 0.017 -0.495 0.18 -0.042
ash 0.064 0.233 -0.043 0.89 0.138
alka.ash -0.226 -0.383 -0.251 0.606 -0.025
magnesium 0.149 0.215 0.009 0.103 0.96
phenol 0.83 0.244 0.203 0.031 0.032
flavanoid 0.909 0.179 0.302 0.015 0.023
nonflav.phen -0.523 -0.079 -0.14 0.295 -0.195
proant 0.661 0.086 0.068 -0.034 0.121
color.intens -0.16 0.614 -0.496 0.116 0.094
hue 0.313 -0.046 0.883 -0.033 0.011
OD/OD 0.773 -0.046 0.383 -0.037 -0.034
proline 0.339 0.75 0.198 -0.009 0.191
a PA Loading with rotation

The result is very similar to the 5-Factor varimax rotated model with PC method.

Communalities for the 5-Factor model is

x
alchohol 0.687
malic.acid 0.354
ash 0.871
alka.ash 0.628
magnesium 1.000
phenol 0.792
flavanoid 0.950
nonflav.phen 0.424
proant 0.465
color.intens 0.671
hue 0.880
OD/OD 0.749
proline 0.754
a Communalities

What we observe that there are a lot of inconsistancy in explaining variance compared to PC mathod.

So we can conclude that the interpretation is much more clear and good if we adapt PC method with 5 factors.

5.4 Factor Scores

Factor scores are the estimates of unobservable random vector \(\mathbf{F_j}\).( obtained from \(j^{th}\) sample) After estimating \(\mathbf{L}\) we assume as if we know \(\mathbf{L}\) and \(\psi=diag(\psi_1,\psi_2, \dots \psi_p)\) and try to estimate \(F\) using weighted least square method. We will be using rotated loadings to compute the factor scores. We can use different estimates of \(L\) to do our computation. However, as we have seen that our data can not be considered as coming from a multivariate normal distribution, neither we have performed neither MLE to estimate the loadings nor we will be considering factor scores based on Regression Method.

Principal Component Solution:- For our subsequent analysis we will be using the \(5\)-factor model. The first \(10\) factor scores obtained by using weighted least square method is presented below:-

1st Factor Scores 2nd Factor Scores 3rd Factor Scores 4th Factor Scores 5th Factor Scores
0.9736686 0.8939429 0.3226787 -0.3334916 1.5928260
0.3044046 0.8029998 0.8608030 -1.5361830 -0.0670229
1.3439035 0.6225607 -0.0906149 0.7000897 -0.1548887
1.6824823 1.8853296 -0.4802619 -0.0595442 0.3895252
0.6061956 -0.0183083 0.2735470 1.6329013 0.9539300
0.7265095 1.9569533 0.6128587 -0.1570641 0.2956971
0.4320040 1.6574377 0.5923118 -0.3688856 -0.6509567
0.1276245 1.2113931 0.9395261 0.4539266 1.0982265
0.5096427 1.6153080 0.4629022 -1.1617836 -0.5600603
1.0696107 1.0827465 0.0730891 -0.7437799 -0.3470467

The sample mean vector and covariance matrix of these Bartlett’s WLS scores are shown below:-

## [1]  7.831564e-17 -3.041120e-16  2.971701e-16  5.683942e-16  5.694802e-18
##              [,1]         [,2]          [,3]          [,4]         [,5]
## [1,]  1.019439409 -0.003705815 -0.0148833008  0.0016863366 -0.003409966
## [2,] -0.003705815  1.003366743  0.0042170598  0.0051667393 -0.007440983
## [3,] -0.014883301  0.004217060  1.0206407199  0.0008374363  0.006302309
## [4,]  0.001686337  0.005166739  0.0008374363  1.0141180919 -0.019602610
## [5,] -0.003409966 -0.007440983  0.0063023086 -0.0196026103  1.045487934

The components of sample mean vector has very small magnitude and can be considered as a zero vector. Similarlly, the diagonal elements of the covariance matrix are close to \(1\) and the off-diagonal entries are also negligible. As they can serve as the estimates of the population parameter we can thus say that these factor scores satisfy the assumption of zero mean, zero correlation (or covariances) between the factors, and unit variance. To check for pairwise independence we have constructed the pairwise scatter plot to see if there is any pattern. The plots are shown below:-

We do not any speciic pattern in all the pairwise plots. The sample correlation coefficients between any two pairs are also very much close to \(0\). So we can say that there not any linear or type of dependency between two factor scores. We can also check the normality assumption of the factor scores though we do not expect that they will show any behaviour close to normality. Had it been following normal distriution we would have expected our data to be normal too which is not the case. The multivariate Shapiro-Wilk test has the following result:-

## 
##  Shapiro-Wilk normality test
## 
## data:  Z
## W = 0.87358, p-value = 4.396e-11

We can see that the \(p\) value is small. Let us have a look at the chi-square \(Q-Q\) plot:-

The \(Q-Q\) plot also shows a sharp deviation from the striaght line towards the right end. We can see the univariate \(Q-Q\) plots:-

Only third and the fourth factor scores are showing pattern which can be deemed as approximately normal behaviour. Thrid and perhaps the fifth factor scores also indicate presence of outliers. We will discuss about them in the next section.

Iterative Principal Factor Method:- When we estimate the loadings based on Iterative method then the first few factor scores obtained by WLS method are shown below:-

1st Factor Scores 2nd Factor Scores 3rd Factor Scores 4th Factor Scores 5th Factor Scores
1.0207893 0.8081754 0.1419373 -0.3723471 1.6882567
0.5004158 0.8111785 0.5298172 -1.4247612 -0.0915397
1.1806827 0.7598717 0.0574675 0.9070406 -0.3597640
1.7449162 1.9439934 -0.9014352 -0.1382616 0.2846431
0.6447161 -0.1955163 0.2395621 1.8114450 1.0784466
0.9264708 2.0604763 0.3076251 -0.2507408 0.3136775
0.1753375 1.8924860 0.6078253 -0.2412000 -0.7033209
0.0887231 1.3207413 0.8068105 0.4407784 1.1866158
0.5315434 1.7257771 0.4585468 -1.2027919 -0.5431397
1.1312759 1.1346829 -0.1143892 -0.7302326 -0.4766487

The sample mean vector and covariance matrix of these scores are as follows:-

## [1]  5.426858e-17 -3.166967e-16  2.798070e-16  8.245412e-16 -7.023681e-17
##              [,1]        [,2]         [,3]         [,4]         [,5]
## [1,]  1.091016064 -0.06407169 -0.089427241  0.008213658 -0.005514678
## [2,] -0.064071688  1.19844833  0.053749006 -0.022014250 -0.030929039
## [3,] -0.089427241  0.05374901  1.180401248  0.009205417  0.005072287
## [4,]  0.008213658 -0.02201425  0.009205417  1.142896730 -0.009723812
## [5,] -0.005514678 -0.03092904  0.005072287 -0.009723812  1.008681814

The sample mean vector can still be considered as a zero vector. The covariance matrix, though shows a slight increment in the entries, are appearing more or less close to an identity matrix. Next we show the pairwise scatter plot:-

The pairwise plots are not indicating any relationship between any two factor scores. When we check for normality, Shapiro-Wilk test has the following result:-

## 
##  Shapiro-Wilk normality test
## 
## data:  Z
## W = 0.94156, p-value = 1.17e-06

The \(p\) value has decreased and rejecting the hypotheis. The same thing has appeared in the chi-square \(Q-Q\) plot:-

The univariate \(Q-Q\) plots are as follows:-

As we can see that not all the factor scores conform with the nature of normality.

Based on the above discussion we can thus say that the factore scores comply with all the assumptions except normality

6 Clasification and Discrimination

6.1 Towards Achieving Normality

In one of our earlier sections dealing with data visualization we have already seen that the \(178\) sample points on \(13\) variables ( the combined data ignoring the groups) do not seem to come from a multivariate normal distribution. This is evident from both the plots as well as hypothesis tests. The deviation from normality appears in the univariate \(Q-Q\) plots itself. So, it will be far from normality in higher dimensions too.

This is not an uncommon thing as most of the datsets we encounter in our prcatical life appears to be non-normal. On the other hand, most of the theoretical techniques are valid only under a distribution which is more or less close to our ideal normal distribution. If the techniques are sensitive to the assumption of normality, then applying them on such a data set might give erroneous result. In such a situation, the best we can try to do is to find certain ways to deal with the non norma data. Our next part will be dealing with finding suitable sample discriminanats to effecively classify a observation into the population (class) where it should belong. The techniques we have encountered for this discriminant and classification purpose either assume multivariate normality independently for each of the population or need to know the true density of the observations. Fortunately, Fisher’s Discriminant Analysis is not build upon the assumption of multivariate normality. But it assumes homogeneity of covariance matrices across the populations, that is \(\Sigma_1=\Sigma_2=\Sigma_3\). The homogeneity can be checked by Bos’s M test. It is a generalisation of the Bartlett’s test of homogeneity and is a modification of the LRT test. The test statistic asymptotically follows \(\chi^2\) distribution. However, this test itself assumes as it is normally a likelihood ratio test. Still, if we try to apply this test to our data we have the following result:-

## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  wine[, 2:14]
## Chi-Sq (approx.) = 684.2, df = 182, p-value < 2.2e-16

The \(p\) value is really very small. Because of the inherent non-normality and non-homogeneity, we will be able to use neither of the methods in our original data set. Probably the following techniques adopted would help us achieve that. All these techniques which we are going to use will try to make the combined data (ignoring the groups) normal. Once we acheive that normality for the individual groups will follow automatically.

6.1.1 Detection of outliers and cleaning the data

Outliers are those observations which appears to be different from the mass of the data. Though they are uncommon data points, they still form a part of the data and add to our information. However, if we have large number of observations, and if very few of the points lack in usual pattern then it is better to discard those points and do our analysis with the reduced dataset. Though it is not always useful, but we can atleast expect some improvement in our results.  

In multivariate analyis there are several methods of detecting outliers. One way is to look for outlier in each of the variable. When this method helps in finding the univariate outliers it may not able give any information about the outliers in multidimensions. In fact we will later see that some of the detected outliers in our dataset are not univariate outliers. However, we will be using techniques to detect both univariate as well as multivariate outliers. For detecting univariate outliers, we will be looking at the large values of the standardized values \(z_{ik}=\frac{x_{ik}-\overline{x_k}} {\sqrt{s_{kk}}}\) for \(i=1,2 \dots, 178\) and each column \(k=1,2, \dots 13\). Johnson & Wichern (1) mentions that observations with \(|z_{ik}|>3.5\) can be treated as large. So we have standardized the observations and we some of them are shown below:-

Alcohol Malic acid Ash Alkalinity of ash Magnesium Total phenols Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity Hue OD/OD Proline
1.5143408 -0.5606682 0.2313998 -1.1663032 1.9085215 0.8067217 1.0319081 -0.6577078 1.2214385 0.2510088 0.3611585 1.8427215 1.0101594
0.2455968 -0.4980086 -0.8256672 -2.4838405 0.0180940 0.5670481 0.7315653 -0.8184106 -0.5431887 -0.2924962 0.4049085 1.1103172 0.9625263
0.1963252 0.0211715 1.1062139 -0.2679823 0.0881098 0.8067217 1.2121137 -0.4970050 2.1299594 0.2682629 0.3174085 0.7863692 1.3912237
1.6867914 -0.3458351 0.4865539 -0.8069748 0.9282998 2.4844372 1.4623994 -0.9791134 1.0292513 1.1827317 -0.4263410 1.1807407 2.3280068
0.2948684 0.2270533 1.8352256 0.4506745 1.2783790 0.8067217 0.6614853 0.2261576 0.4002753 -0.3183774 0.3611585 0.4483365 -0.0377675
1.4773871 -0.5159113 0.3043010 -1.2860793 0.8582840 1.5576991 1.3622851 -0.1755994 0.6623487 0.7298108 0.4049085 0.3356589 2.2327407

A serach for large values in each of the above column indicate the following observations :- \(60^{th}\) obervation of the variable “Ash”, \(70^{th}\) and \(96^{th}\) obsertvation of “Magnesium”. These observations exceed our limit \(3.5\). So they are univariate outliers in the respective variables. These outliers appear in the boxplots which we have shown in our first section.

The multivariate outliers are normally detected on the basis of large values of sample Mahalanobis Distances \(\mathbb{(x_i-\overline{x})'} S^{-1} \mathbb{(x_i-\overline{x})}\). The largeness is decided on the basis of appropriate percentile of ch-square distribution with \(13\) degrees of freedom. In our case, we will declaring all those points which exceed the upper \(5\) percentile of \(\chi^2\) distribution which is \(29.81947\).

We present a plot which shows the sample distances corresponding to serial number of each of the sample point in \(x\) axis. The points exceeding the cut-off value are marked in red.

The sample points with the following serial numbers are appearing as multivariate outliers:- \(70, 74, 96, 111, 122, 159\). Except the \(60^{th}\) observation which appeared as univariate outliers in “Ash”, all other univariate outliers are also multivariate outliers. The value of Mahalanobis Distance corresponding to sample observation no. \(60\) is \(27.63\), which is also close to our cut-off value of chi-square distribution. So we are discarding this observation as well. Finally, we have identified \(7\) observations as extraordinary or outlying from the rest of the data. We delete these observations as the number of deleted observations is very less compared to our total sample size \(178\). For our further normality checking and classification procedure, we will be dealing with the reduced data set from now onwards.

Hoping that we have cleaned our data to a large extent we now redo our checking of multivariate normality for this reduced data. The multivariate shapiro test value now has the following result:-

## 
##  Shapiro-Wilk normality test
## 
## data:  Z
## W = 0.90803, p-value = 7.232e-09

The \(p\) value though increases as compared that of original dataset (which was in the order of \(10^{-13}\)) is still very less and rejecting the null hypothesis of normality. The chi-square \(Q-Q\) plot for the reduced data is shown below:-

With the removal of outliers, the plot now shows no indication of extreme observations. But still there are ample amount of deviations. Though initially the sample quantiles agreed with theoretical ones, after a certain point of time the sample quantiles showed an increasing pattern. Thus we can not expect the normality in our reduced dataset too. The univariate \(Q-Q\) plots for each of the variable for this dataset are shown below:-

An enlarged view of the above plots (which can be done by running the R code available in our .rmd file separately in console) shows clear indication of non-normality except a few variables. In majority of the cases the sample quantiles are showing a pattern (as a whole or sometimes in the tails) which does not match with the theoretical quantiles.  

It is clear that the reduced data has no nature of normality. We can try to check the homogeneity of covariance matrices using Box’s M test. The result is as follows:-

## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  Onwine4[, 2:14]
## Chi-Sq (approx.) = 589.07, df = 182, p-value < 2.2e-16

There is not any change in \(p\) value and still in the order of \(10^{-16}\). So, the data still remains non-homogeneous with respect to the different groups. There is no scope of doing any discrimination analysis in this data. We now move to the transformation of the data which might give a good result.

6.1.2 Transformation of variables

Box and Cox proposed a family of power transformations known as “Box-Cox transformation” which seem to be helpful in transforming a variable near normality. Its generalization to the multivariate case in fact applies this transformation to each of the variables. In the context of our problem, it will choose a vector of \(13\) parameters \(\mathbb{\underline\lambda}=(\lambda_1, \lambda_2, \dots, \lambda_{13})\) in such a way that \[\begin{equation} l(\underline \lambda)=\frac{-n}{2} \log\left(\frac{1}{n} \sum_{j=1}^n \left(x_{jk}^{\left(\lambda_k\right)} - \overline{x_{jk}^{\left(\lambda_k\right)} } \right)^2 \right)+(\lambda_k-1) \sum_{j=1}^n \log x_{jk} \end{equation}\] becomes maximum over all such \(\underline \lambda\). Here \(x_{jk}\) is the \(j^{th}\) observation of \(k^{th}\) variable, \(j=1,2, \dots n(=171)\); \(k=1,2,\dots 13\). For any observation \(x\) the transformed value \(x^{\left(\lambda\right)}\) takes value \(\frac{x^{\lambda}-1}{\lambda}\) or \(\log x\) accoring as \(\lambda\) is non zero or zero. This procedure normally tries to make each variable marginally normal. But sometimes this may not result in multivariate normality. Hence, there is an iterative procedure which takes the initial \(\underline \lambda\) obtained from maximizing the above equation and iteratively maximize \(\frac{-n}{2} \log|S\mathbb(\underline \lambda)| +\sum_{i=1}^{13} (\lambda_i-1) \sum_{j=1}^n \log x_{ji}\), where \(S\mathbb(\underline \lambda)\) is the covariance matrix obtained from the transformed observations. Upon maximizing each variable separately we have the following estimated values of \(\mathbb {\underline \lambda}\):-

##  [1]  1.4686 -0.2652  1.1136  0.8429 -0.8608  0.6526  0.8022  0.2445  0.6574
## [10]  0.1444  0.8634  1.3790 -0.0960

The value of the quantity \(l( \mathbb{\underline \lambda})\)for this choice of $ $ is \(147.4028\)

The iterative procedure results the following estimated \(\mathbb{\underline \lambda}\):-

##  [1]  1.46860 -0.26520  1.11360  0.84290 -0.86080  0.65260  0.80220  0.24450
##  [9]  0.65740  0.14440  0.86340  1.37900  0.05086

For these choices of \(\lambda_k\) the value of \(l(\mathbb{\underline \lambda})\) appears to be \(148.938\) which is slightly higher than the earlier choice. So we will proceed with this transformation.

Upon applying this transformation on our reduced datset we have applied the Box-Cox transformation on \(k^{th}\) variable with \(\lambda_k\) as the parameter. First few columns of our transformed datset is presented below:-

Group Alcohol Malic acid Ash Alkalinity of ash Magnesium Total phenols Flavanoids Nonflavanoid phenols Proanthocyanins Color intensity Hue OD/OD Proline
1 32.94647 0.5000752 1.515704 10.833736 1.143757 1.467993 1.810904 -1.0939142 1.1014204 1.965115 0.0398923 4.045483 8.366347
1 29.43328 0.5346900 1.197183 7.904581 1.139655 1.362099 1.568015 -1.1477123 0.2680200 1.646388 0.0498324 3.195370 8.346134
1 29.29936 0.7679127 1.782622 12.754699 1.139844 1.467993 1.954361 -1.0429457 1.4790535 1.974192 0.0299392 2.834398 8.518960
1 33.43346 0.6120324 1.593258 11.608527 1.141858 2.161053 2.151025 -1.2047340 1.0179081 2.391260 -0.1414155 3.275097 8.839384
1 29.56739 0.8410645 2.007158 14.256297 1.142584 1.467993 1.510605 -0.8410789 0.7338285 1.629332 0.0398923 2.468184 7.842637
1 32.84241 0.5249782 1.537837 10.573419 1.141705 1.787734 2.072705 -0.9482577 0.8543417 2.198771 0.0498324 2.348574 8.809714
1 33.50321 0.5767451 1.537837 10.180910 1.138867 1.254101 1.369929 -1.0429457 0.8622619 1.873600 0.0199729 3.484433 8.640907
1 32.35814 0.6927732 1.715627 12.120204 1.142993 1.326341 1.361596 -1.0184191 0.2403411 1.824390 0.0597595 3.484433 8.646476
1 35.04919 0.4636195 1.229917 9.785859 1.139070 1.467993 1.746614 -1.0680979 0.8622619 1.861450 0.0795756 2.348574 8.339335
1 31.67025 0.2884729 1.339399 11.093006 1.139269 1.592498 1.882835 -1.2654672 0.7581955 2.287888 0.0099932 3.435865 8.339335

Because of the transformation we can now see the changes in the magnitude of the different variables. Specifically, the “Proline” column now does not have very high values.

After obtaining the transformed data, we now proceed to check the assumptions of normality again. The multivariate Shapiro-Wilk test has the following result:-

## 
##  Shapiro-Wilk normality test
## 
## data:  Z
## W = 0.92541, p-value = 1.055e-07

There has been slight improvement in the \(p\) value as it increases a bit and now is at the order of \(10^{-7}\) as opposed to \(10^{-9}\) which we noticed in the untransformed dataset. But it is still very low to infer anything in favour of normality. The chi-square \(Q-Q\) plot is shown below:-

The \(Q-Q\) plot also has not changed a lot and is still showing sample quantiles higher than the theoretical ones in the last part of the plot. Let us have a look at the univariate \(Q-Q\) plots:-

Despite some minor improvements in some of the individual plots, the variables which earlier seemed to have significant deviation from the straight line, are still having the same nature. In other words, not all variables have been succesfully transformed to univariate normal distribution. Thus even this transformation could not help us in acheiving normality.   The Box’s M test of homogeneity of covariance matrices gives the following output:-

## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  Otwine[, 2:14]
## Chi-Sq (approx.) = 546.41, df = 182, p-value < 2.2e-16

There is no change in the \(p\) value and hence we can neither proceed with Fisher’s Discriminant analysis

6.1.3 Dimension Reduction in terms of Principal Components

WE have see that discarding the outliers and transforming the observations were of little to no help for our purpose. Therefore, we are going to represent the reduced data (both transformed and untransformed) in terms of principal components. The section on “Principal Components” has already dealt most of the aspects related to the topic in the context of original data. Principal components has its own significance in explaining the variability of a datset in terms of a few principal components. By doing so, it in fact reduces the dimension of the data from \(n\) obserbations on \(p\) variables to \(n\) observations to \(k\) chosen principal components. \(k\) is often less than \(p\) and hence this procedure helps us to reduce the curse of dimensionality when \(p\) is large. However, reduction in dimension is not the reason for which we are directing our focus on them. Pricipal Components are actually linear combinations of the observations on original variables. So, if \(p\) is large then it is expected that the components will be approximately normally distributed. So perhaps reducing the datset in terms of some components might help us to achieve normality. For that purpose we have performed principal component analysis on our reduced (untransformed data). Firstly we show the results obatined based on eigenvalue decomposistion of \(S\), the covaraince matrix

## Importance of components:
##                             Comp.1       Comp.2       Comp.3       Comp.4
## Standard deviation     317.9896751 11.403274837 2.830221e+00 2.119307e+00
## Proportion of Variance   0.9985681  0.001284135 7.910287e-05 4.435468e-05
## Cumulative Proportion    0.9985681  0.999852232 9.999313e-01 9.999757e-01
##                              Comp.5       Comp.6       Comp.7       Comp.8
## Standard deviation     1.036318e+00 8.867094e-01 5.045662e-01 3.487665e-01
## Proportion of Variance 1.060566e-05 7.764515e-06 2.514131e-06 1.201216e-06
## Cumulative Proportion  9.999863e-01 9.999941e-01 9.999966e-01 9.999978e-01
##                              Comp.9      Comp.10      Comp.11      Comp.12
## Standard deviation     3.256605e-01 2.398112e-01 1.820484e-01 1.440342e-01
## Proportion of Variance 1.047326e-06 5.679243e-07 3.272844e-07 2.048721e-07
## Cumulative Proportion  9.999988e-01 9.999994e-01 9.999997e-01 9.999999e-01
##                             Comp.13
## Standard deviation     8.859829e-02
## Proportion of Variance 7.751797e-08
## Cumulative Proportion  1.000000e+00

The scree-plot is shown below:-

Both from the R output abd scree-plot it is observed that only the first factor is enough to explain a fair amount of variation in the data. It explains more than \(99\%\) of the total variance. We now check for normality assumptions of the first sample principal components. The Shapiro-wilk test has the following result:-

## 
##  Shapiro-Wilk normality test
## 
## data:  Oprin1$scores[, 1]
## W = 0.93063, p-value = 2.528e-07

The Shapiro test is clearly rejecting the null hypothesis of normality for the first principal component. THe \(Q-Q\) plot of the component is shown below:-

We can see that the \(Q-Q\) plot is showing serious departure from the straight line. Both the results indicates presence of obvious non-normality in the first principal component scores. Let us use Bartlett’s test to check the homogenity of variances of this component across different groups:-

## 
##  Bartlett test of homogeneity of variances
## 
## data:  Opc_matrix1[, 2] and Opc_matrix1[, 1]
## Bartlett's K-squared = 23.863, df = 2, p-value = 6.581e-06

The test rejects the homogeneity of the component across different groups. So we will not proceed with this component further. Rather we now see what happens if we use eigen decomposistion of correlation matrix \(R\) in place of \(S\). The result and the scree plot is shown below:-

## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     2.2256349 1.6202434 1.1409521 0.95268344 0.87468932
## Proportion of Variance 0.3810347 0.2019376 0.1001363 0.06981583 0.05885242
## Cumulative Proportion  0.3810347 0.5829723 0.6831086 0.75292440 0.81177681
##                            Comp.6     Comp.7    Comp.8     Comp.9    Comp.10
## Standard deviation     0.77575213 0.70271180 0.5793337 0.54060358 0.50790770
## Proportion of Variance 0.04629164 0.03798491 0.0258175 0.02248094 0.01984386
## Cumulative Proportion  0.85806846 0.89605337 0.9218709 0.94435181 0.96419567
##                           Comp.11    Comp.12     Comp.13
## Standard deviation     0.46840744 0.40752918 0.282790844
## Proportion of Variance 0.01687735 0.01277539 0.006151589
## Cumulative Proportion  0.98107302 0.99384841 1.000000000

Perhaps we can keep the first \(4\) or \(5\) components which explain around \(75\%\) and \(81\%\) of the total variation. The R output of Shapiro-Wilk test for the first four components is as follows:-

## 
##  Shapiro-Wilk normality test
## 
## data:  Z
## W = 0.96617, p-value = 0.0003561

For the first \(5\) components we have the following:-

## 
##  Shapiro-Wilk normality test
## 
## data:  Z
## W = 0.97011, p-value = 0.0009592

Both the \(p\) values are at the order of \(10^{-4}\). The impovement on using principal components is clear but probably normality is still not achieved as required.

The chi-square \(Q-Q\) plots for the \(4\) and \(5\) components are shown in the following figure in the left and right direction respectively.

As we can see non of the plots give any indication of normality. We can also look at the univariate \(Q-Q\) plots for each of the five components:-

The first, third and the fifth components are deviating much from the expected pattern of straight line. The first two components are heavily non-normal as compared to the other components. This is not a good indication as we would basically want to work with first few components. Thus probably we can not proceed with these components assuming that they are normal as there are lots of indication about their non-normal nature. Next we look at the results of hypothesis test of homogeneity of covariance matrices. For \(4\) components’ the result is as follows:-

## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  Opc_matrix2[, 2:5]
## Chi-Sq (approx.) = 115.56, df = 20, p-value = 1.883e-15

For the \(5\) component the result of the same test is stated below:-

## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  Opc_matrix2[, 2:6]
## Chi-Sq (approx.) = 153.05, df = 30, p-value < 2.2e-16

As earlier, both the \(p\) values are very small and this shows that we can not use Fisher’s Discrimination Analysis with these components.  

Thus we have seen that even after reducing dimension we are still at the same situation where we have no tool to proceed for classification and regression. As a last hope we can now repeat the above procedures with the transformed data. As we told earlier, these components are just linear combination of the original variables. The transformation tries to make each of the variable normal. Linear combination independent of normal variables are again normal. We have already seen that using the principal components in terms of variables shows an improvement on \(p\) value of multivariate Shapiro-Wilk test. So probably we can expect more if we do the same thing on the principal components of the transformed data. Firstly let us see the results of the principal component analysis for correlation matrix \(R\):-

## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4     Comp.5
## Standard deviation     2.1910799 1.6776739 1.1483626 0.94716451 0.87293335
## Proportion of Variance 0.3692947 0.2165069 0.1014413 0.06900928 0.05861636
## Cumulative Proportion  0.3692947 0.5858016 0.6872429 0.75625213 0.81486849
##                            Comp.6    Comp.7     Comp.8     Comp.9    Comp.10
## Standard deviation     0.76259381 0.6859717 0.57899374 0.56018958 0.50013768
## Proportion of Variance 0.04473456 0.0361967 0.02578721 0.02413941 0.01924136
## Cumulative Proportion  0.85960305 0.8957998 0.92158697 0.94572638 0.96496774
##                           Comp.11    Comp.12     Comp.13
## Standard deviation     0.46573376 0.38929306 0.294893735
## Proportion of Variance 0.01668523 0.01165762 0.006689409
## Cumulative Proportion  0.98165297 0.99331059 1.000000000

It can be seen that the first \(5\) factors are explaining around \(81\%\) of the total sample variation. The multivariate Shapiro-Wilk test applied on these \(5\) components has the following result:-

## 
##  Shapiro-Wilk normality test
## 
## data:  Z
## W = 0.96048, p-value = 9.214e-05

The \(p\) value is still very less and the test rejects the hypothesis. The chi-square \(Q-Q\) plot of the \(5\) factor scores are shown below:-

The plot is obviously showing another pattern than the straight line. The univariate \(Q-Q\) plots are shown below:-

Component \(1\), \(2\) and \(5\) are still showing non-normality and deviating severely from the straight line. The test for homogeneity for these components has the following output:-

## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  Otprin2_scores[, 2:6]
## Chi-Sq (approx.) = 153.35, df = 30, p-value < 2.2e-16

Similar to our earlier results, the data is not homogeneous across the groups.   Now let us take covariance matrix \(S\) to compute the principal components. The R output and the scree-plot are presented below:-

## Importance of components:
##                          Comp.1    Comp.2     Comp.3     Comp.4     Comp.5
## Standard deviation     2.903246 1.8499395 1.16679304 0.51396409 0.40430272
## Proportion of Variance 0.599087 0.2432413 0.09676313 0.01877534 0.01161811
## Cumulative Proportion  0.599087 0.8423283 0.93909139 0.95786673 0.96948484
##                             Comp.6      Comp.7      Comp.8      Comp.9
## Standard deviation     0.351658245 0.293008717 0.279125757 0.232285453
## Proportion of Variance 0.008789494 0.006102156 0.005537606 0.003835008
## Cumulative Proportion  0.978274333 0.984376489 0.989914095 0.993749103
##                            Comp.10     Comp.11     Comp.12      Comp.13
## Standard deviation     0.193589529 0.182490038 0.131010845 1.855627e-03
## Proportion of Variance 0.002663704 0.002367013 0.001219935 2.447392e-07
## Cumulative Proportion  0.996412807 0.998779820 0.999999755 1.000000e+00

We can observe a band in the scree plot occuring at the \(4^{th}\) principal component. Moreover, the first four componenets are explaining about \(96\%\) of the total variance. The proportion of variance explained by the components starting from fifth one are small and more or less same. Now we proceed to check normality assumptions. The multivariate Shapiro-Wilk test for these four components are shown below:-

## 
##  Shapiro-Wilk normality test
## 
## data:  Z
## W = 0.99151, p-value = 0.4089

Probably out attempt did not go in vain. This time we have seen a record improvement on the \(p\) value which comes out as about \(0.41\). The test thus fails to reject the null hypothesis of normality. While this test gives us a posistive hope we try to look at the graphical deciding factors like \(Q-Q\) plots, pairwise contour plots etc. In the mean time, we would like to mention that we could have considered the first three factors instead of four as they also explain a very good amount of variation. Howver, we have seen that the Shapiro-Wilk test produces a lower \(p\) value (around \(0.23\)) as compared that of \(4\) component model. So, we will not proceed with three components but with four.

The chi-square \(Q-Q\) plot of the four components is shown below:-

This time we have added a pointwise \(95\%\) confidence band around the reference line. Except a minor deviation in the left tail most of the points fall within the confidence band. There has been a lot of improvement in our plot. So, the plot matches with the conclusion of Shapiro-Wilk test. Now we make the univariate \(Q-Q\) plots. We have added similar kind of confidence band in these plots too.

In our earlier results, we have consistently seen that the the first and second principal component show no nature of normality. In contrast to those results, now we see that all the four components conform the pattern of the straight line except possibly in the two tails. Most of the points lie within the confidence band. The \(p\) values of the Shapiro-Wilk test for the individual components are shown in the following table:-

Components p-value
Comp.1 0.0424361447518245
Comp.2 0.0336465151143262
Comp.3 0.0565595265113273
Comp.4 0.11697480720397

We notice that all the test statistics are failing to reject the null hypothesis at \(1\%\) level of significance.

  After checking univariate normality now we check for bivariate normality by constructing chi square \(Q-Q\) plots for each pair of the component. The plots are shown below:-

Except the plot for component \(2\) and \(3\) all other plots more or less lie within the confidence band. This plot and minor deviations in other plots probably raises some doubts. The failire of rejection of hypothesis tests does not automatically imply that the null hypothesis can be accepted. But based on whatever we have done so far, it is quite clear that we have acheived lots of things. Hypothesis tests, chi-square plot of mutivariate normality, univariate \(Q-Q\) plots and most of the bivariate chi-square plots are giving results in favour of normality. Therefore, we deem the \(171\) observations on these \(4\) components as approximately normal. Alternatively we can say that these dataset will be at least feasible to use for clasification and dicrimination purpose under the assumption of normality. We show a few data points on these \(4\) components below:-

Group Comp.1 Comp.2 Comp.3 Comp.4
1 5.0476951 -0.9225705 1.0740911 0.5974586
1 3.0452826 -4.4528441 -1.2720161 0.4064876
1 1.0160452 -0.7588923 1.1348959 -0.8762201
1 5.2765474 0.0353568 1.2155711 -0.4039675
1 0.3416225 0.7991359 1.2609255 -0.2242545
1 5.0376813 -0.7290545 -0.0051358 -0.5661131
1 5.6956956 -0.8686017 0.1566233 0.7033373
1 3.8503888 0.1748650 0.8753364 0.6108093
1 7.1159073 -0.2978988 -0.4421262 0.4311456
1 3.8072734 -1.1235500 0.6808833 0.0100637

At last we check of homogeneity of covaraince matrix. The results are expressed below:-

## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  Otprin1_scores[, 2:5]
## Chi-Sq (approx.) = 92.399, df = 20, p-value = 2.814e-11

The \(p\) value still remains very low. Thus we will be able to use only Quadratic Classification Procedure to this new datset. The subsequent portion on classification and discrimination are described in the following section.

6.2 Quadratic Classification Rule

Under the asumption of normality the classification with several populations can be done either with linear or quadratic classification rule. However, linear rule assume homogeneity of covariance matrix which is not assumed in the latter. So we will try to use quadratic classification procedure to our data. As we have noted in our arlier section we have already formed a representative data set of the original data in terms of principal components. Based on our conclusion in the former section we are assuming that the distribution of the four principal components is a \(4\)-variate multivariate normal distribution for each of the three groups. We observe below their nature of Bivariate Normality

From the above ellipses we observe that the observations are more or less jointly bivariate normal. Also we observe that PC1 and PC2 are discriminating the groups very well compared to the other pairs, which is expected . So we can proceed to perform QDA. Below we highlight the first two PC1 and PC2 groupwise

As we commented before they are more or less normal.

As we do not know the parameters we will be using the sample discriminant scores. Quadratic classification procedure will allocate a observation \(\mathbb{x}\) (in our case this is just a \(4\) component vector of principal components) to population \(k\) if \(\hat{d_k^{Q}}(\mathbb{x}) \ge \hat{d_i^{Q}}(\mathbb{x})\), \(\forall i=1,2,3; i \not = k\) where \(d_k^{Q}(\mathbb{x})=-0.5 \left(\log |S_k|+ (\mathbb{x-\overline{x_k}})'S_k^{-1}(\mathbb{x-\overline{x_k}}) \right) +\log(1/3)\). \(S_k, \mathbb{\overline{x_k}}\) are the sample covariance matrix and sample mean for the \(k^{th}\) population. These allocation made in a way so that the total probability of misclassification become minimum.

For each of the sample point we can calculate three scores and the obervation will be allocated to that group for which the score is maximum. We have performed this procedure to check if the predicted group really matches with the actual group. For that purpose we are presenting below the confusion matrix of misclassification.

##      [,1] [,2] [,3]
## [1,]   59    0    0
## [2,]    2   61    2
## [3,]    0    0   47

The first row and third row indicates that all the obsevatios belonging to group \(1\) and \(3\) has been correctly specified. However, two observations which originally belonged to group \(2\) has been misclassified as group \(1\), and similarly two observations has been misclassified as group \(3\). Rest all are correctly predicted. Thus \(2+2=4\) observations out of \(171\) observations are misclassified. So, the apparent error rate (APER) is \(\frac{4}{171}=0.0234\). So, the performance of the classification rule is pretty good with an accuracy rate of \(97.7\%\). However, as we know that APER tend to underestimate the actual error rate (AER) because the same dataset was used to build the classification rule as well as for evaluating the classification function. A better procedure is to use Lachenbruch’s “holdout” procedure. In this procedure, the \(i^{th}\) observation is classified based on the classification function build using the \(n-1\) observations which omits the \(i^{th}\) one. We can compute the misclassifiation rates as earlier. In our problem, the confusion matrix obtained by using “holdout” procedure is given below:-

##      [,1] [,2] [,3]
## [1,]   58    1    0
## [2,]    3   60    2
## [3,]    0    0   47

This time we see that one observation originally belonging to group \(1\) has been misclassified as group \(2\). Three observations belonging to group \(2\) has been misclassified as group \(1\). And, two observations from group \(2\) has been misclassified as group \(3\). Overall six obervations has been misclassified. Thus the apparent error rate becomes \(\frac{6}{171}=0.0351\) which we can deem as an unbiased estimate of AER. The accuracy is \(96.5\%\). So we have seen that the quadratic classification rule performs very good in terms both APER and AER.
In one of the comments made in our earlier section we maentioned about taking the first three components for our further analysis. Howver, because of low \(p\) value we did not proceed and retained the four factors. But the first three components also explanied a fair amount of variation. So we can see how a quadratic classification rule performs if we use the first three components. The usual confusion matrix in that case is given below:-

##      [,1] [,2] [,3]
## [1,]   57    2    0
## [2,]    3   61    1
## [3,]    0    0   47

This time not all the elements of group \(1\) has been correctly classified. The total number of observations misclassified in group \(2\) remains same but the configuration changes. So APER increases to \(0.351\). The confusion matrix based on leave one-out “hold-out” procedure is shown below:-

##      [,1] [,2] [,3]
## [1,]   57    2    0
## [2,]    3   60    2
## [3,]    0    0   47

This time also we see an increment in the number of misclassified units in group \(1\). Thus the AER increases a bit to \(0.041\). So, in both the cases the rule based on the four components outperform the rule based on first three components. Howver, there has not been that much difference in their accuracy rate. If the gain in accuracy is not that important than reducing dimension, then probably we can classify the objects based on the rule obtained on using the first three components. We can see how other combinations of three components perform based on APER and AER.

Componenets APER AER
1,2,4 0.0936 0.117
2,3,4 0.2281 0.269

Thus we can see that both the combinations have higher error rates than our original rule (and also the previously considered three component rule). The steep increase in the error rates of the second combination probably shows that the first principal component is necessary for our classification purpose.

In our case it is not possible to show the quadratic classification regions as we have four components in our original classification rule. However, we can show the boundaries for the discriminant rules obtained by taking two components at a time. The following figures show the boundaries along with the apparent error rates at the top. Observations are marked as either \(1\),\(2\) or \(3\) according as they are classified and those marked in red denote the misclassified units.

The rule based on component \(1\) and \(2\) has lowest error rate when compared with others. This gives us the impression that the components which are taking the lead role in explaining the variability in the data are also doing the same while classifying the data more accurately. Moreover, all the plots with including component \(1\) are performing better than all other plots. So we can try to have a quadratic rule based on the first component alone which seem to have highest impact on classification. Such a rule based on the first component has both APER and and AER as \(0.26\). The error rate is high as we are trying to classify based on obserations on only one component. But one thing we can note that it is performing better as compared to most of the pair-wise rules which do not contain the first component. Our next section deals more with classification based on single variable.

6.3 Classification Rule based on Density Estimate

We already stated that a discrimination procedure will either assume normality or we must know the true density. All our attempts made earlier which involved transformation of the variables, reduction of dimesion with principal components etc. were meant for achieving normality. The base of our classifucation was minimum expected cost of misclassifucation (ECM). The actual minimum ECM rule in fact involves the true denisty of the each of the population. If misclassification costs are equal such a rule allocates an observation \(\mathbf{x_0}\) to population \(\pi_k\) if \(p_k f_k(\mathbf{x})>p_i f_i(\mathbf{x}) \forall i \not= k\), where \(f_i\) and \(p_i\) are the true density and the prior probability of the \(i^{th}\) population. The linear and quadratic rules are derivative of this rule when \(f_i\) is multivariate normal for all \(i\).

In most of the cases it is difficult to know about the true denisty. However, there are certain methods in non-parametric inference which help us to estimate the density. Kernel density estimate is one such type of estimation. If \(X_1,X_2, \dots X_n\) are a random sample from a density \(f\) then the kerel density estimate based on this sample is given by \(f_h(x)=\frac{1}{nh} \sum_{i=1}^n K\left(\frac{x-X_i}{h}\right)\) where \(K\) is a denisty defined on real line (often referred to as kernel) and \(h\) the bandwidth which generally depend on \(n\) in such a way that \(h \to 0\) as \(n \to \infty\). This gives us an opportunity to use kernel estimate in minimum ECM rule. But we have \(13\) variables and we have not come across methods dealing with such estimates in higher dimension. Thus we will be using one variable at a time to estimate density of the variale and use it to form our classification rule. So ultimately we will be classifying based on one variable. We are going to use the optimal kernel while calculating the estimates which is given by:- \(f(x)=\frac{3}{4}(1-x^2)\) if \(|x|<1\) and \(f(x)=0\) otherwise. The optimal bandwidth is normally is of the order \(C n^{-\frac{1}{5}}\), \(C\) is a constant dependent on the true density \(f\). If \(f\) is normal the optimal bandwidth is given by \(h=1.06 \hat{\sigma}n^{-\frac{1}{5}}\), where \(\hat{\sigma}\) is estimated from the data. We will use the four principal components as well as the transfomred variables from the reduced data for classifucation. The components we have alreday seen to be approximately univariate normal. And, the transformed variables, though not all of them completely achieved normality, we will assume they are univariate normal. Under such circumstances we can use the above value of \(h\).  

Firstly let us have a look at the APER and AER’s of the density estimate based rule applied on each of the four components:-

Component APER AER
1 0.2632 0.2865
2 0.3801 0.4035
3 0.4386 0.5029
4 0.5497 0.5906

We see that as move away from the first component the APER and AER increaes. It again matches our earlier claim that component \(1\) is doing very good as compared to others. We also note that the AER for component \(1\) increases to \(0.286\) as compared to the quadratic rule based on the component. So prhaps quadratic rule is slightly better. We can also plot the density estimates for each of the population to visually watch which component is discriminating more. The plots are presented below:-

From the plots it also becomes clear that the overall separation is better in case of component \(1\). Component \(2\) and \(3\) also seem to discriminate group \(1\) and group \(3\) well. On the other hand, component \(4\) does not show that much discriminating power.

Next we find the APER and AER of the density estimate based rule for each of the transformed variable and presenting it below:-

Variables APER AER
Alcohol 0.3275 0.3567
Malic acid 0.3509 0.3684
Ash 0.5088 0.5848
Alkalinity of ash 0.4152 0.4678
Magnesium 0.4269 0.4678
Total phenols 0.3099 0.3275
Flavanoids 0.1696 0.1871
Nonflavanoid phenols 0.4444 0.4444
Proanthocyanins 0.386 0.4035
Color intensity 0.2515 0.2865
Hue 0.3041 0.386
OD/OD 0.2632 0.2865
Proline 0.269 0.3099

The table gives some impressive results. It can be seen that “Flavanoids” has the lowest error rates (both AER as well as APER) amongst all the variables. In fact “Flavanoid” is performing better than both the quadratic rule and denisity based rule based on principal component \(1\). Some other variables which produce more or less similar error rates as component \(1\) are:- “Proline”, “OD/OD” and “Color Intensity”. We show the density estimate plots of these four variables below:-

As we expected, the density plot of “Flavanoid” is discriminating the populations very well. It has the high discriminatory power while separating group \(1\) from \(3\). Other variables also do well. For e.g. “Color Intensity” is able to separate group \(2\) from \(1\) as well as \(3\). “OD/OD”, on the other hand separates group \(3\) from \(2\) and \(1\) but probably confuses the latter two. “Proline” is good at separating group \(1\) and \(2\).